[SPARK-17244] Catalyst should not pushdown non-deterministic join conditions#14815
[SPARK-17244] Catalyst should not pushdown non-deterministic join conditions#14815sameeragarwal wants to merge 5 commits intoapache:masterfrom
Conversation
|
cc @brkyvz who found this bug |
| @@ -1386,15 +1386,17 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { | |||
| object PushPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper { | |||
| /** | |||
| * Splits join condition expressions into three categories based on the attributes required | |||
There was a problem hiding this comment.
It sounds like this PR is targeting both join conditions and filter predicates. Could you update the code comment of this line and the title of the PR?
|
Test build #64436 has finished for PR 14815 at commit
|
|
Test build #64437 has finished for PR 14815 at commit
|
|
Thanks @gatorsmile, I address your comments. |
|
Looks pretty good to me! Could you also add one more test case for testing the following scenario: I am just afraid the others might break it in the future. Thanks! |
|
That's a good idea. More generally, I think the best way to fix these class of problems is to unify all predicate pushdown logic at a common place. |
|
Test build #64496 has finished for PR 14815 at commit
|
|
In 2.0, we combined the PPD rules. Now, only two PPD rules left, |
|
LGTM pending testing |
|
Test build #64500 has finished for PR 14815 at commit
|
|
LGTM. Merging to master and branch 2.0. |
…ditions ## What changes were proposed in this pull request? Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that. ## How was this patch tested? A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14815 from sameeragarwal/constraint-inputfile. (cherry picked from commit 540e912) Signed-off-by: Yin Huai <yhuai@databricks.com>
| * to evaluate them. | ||
| * Splits join condition expressions or filter predicates (on a given join's output) into three | ||
| * categories based on the attributes required to evaluate them. Note that we explicitly exclude | ||
| * on-deterministic (i.e., stateful) condition expressions in canEvaluateInLeft or |
There was a problem hiding this comment.
typo: non-deterministic
There was a problem hiding this comment.
good eye! can you please fold this change into one of your open PRs :) ?
What changes were proposed in this pull request?
Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that.
How was this patch tested?
A new test in
FilterPushdownSuitethat checks catalyst behavior for both deterministic and non-deterministic join conditions.